Address residual P1+P2s from re-audit of PR #412#422
Merged
Conversation
The restored CI reviewer surfaced findings the degraded reviewer missed across all 5 prior rounds on PR #412: P1 (REGISTRY + code comment): the claim that "R does not ship per-path predict_het on placebos either, so parity is preserved by deferral" contradicts what R's `did_multiplegt_dyn(..., by_path, predict_het)` dispatcher actually does - it forwards `predict_het` into each per-path `did_multiplegt_main` call along with `placebo`, so R may emit per-path placebo heterogeneity rows we do not yet mirror. Rewrite both surfaces (chaisemartin_dhaultfoeuille_results.py code comment and REGISTRY.md DataFrame-integration paragraph) as an explicit Python- side deferral rather than a verified R-parity. Add a TODO row to track validating R's actual placebo predict_het output and either implementing parity or documenting the deviation explicitly. P2 (REGISTRY rtol claim): the per-path heterogeneity R-parity paragraph claimed "rtol ~1e-6 on point estimates AND SE", but the parity tests use BETA_RTOL=1e-6 and SE_RTOL=1e-5 (one decade looser on SE). Split the claim into the two separate tolerances and note the WLS-denominator/cohort-recentering numerical drift that motivates the looser SE bound. P2 (replicate-weight df_survey refresh): the existing test only checked finite SE; it would have passed if the new dedicated heterogeneity refresh loop failed to recompute t_stat / p_value / conf_int at the final df_survey. Strengthen the test to call `safe_inference(beta, se, df=df_survey)` on the first finite entry and assert the stored inference fields match - this anti-regression covers the dedicated post-call refresh added for path_heterogeneity_ effects. P2 (paths_of_interest survey gap): the documented composability of `paths_of_interest + heterogeneity + survey_design` was not regression- locked - all existing survey-specific tests used `by_path=k`. Add test_paths_of_interest_heterogeneity_survey_design_analytical (verify analytical Binder TSL fits, selector ordering preserved, finite SE per populated (path, l)) and test_paths_of_interest_heterogeneity_ survey_n_bootstrap_gate (verify the multiplier-bootstrap gate applies under paths_of_interest too). No estimator behavior, weighting, variance/SE, identification, or default statistical surface changed in source - documentation accuracy plus expanded regression coverage only. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment ✅ Looks good — no unmitigated P0/P1 findings. Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
|
This was referenced May 14, 2026
igerber
added a commit
that referenced
this pull request
May 15, 2026
Two informational findings from the latest review: 1. Stale `to_dataframe(level="by_path")` docstring at `chaisemartin_dhaultfoeuille_results.py:1530-1558` still claimed placebo `het_*` columns are NaN. Updated to document the post-#422 contract: positive-horizon AND negative-horizon (placebo) rows are both populated when `placebo=True + heterogeneity=` are co-set; placebo rows under `survey_design` remain NaN with a fit-time UserWarning. 2. JSON golden-fixture type instability for empty `placebo_predict_het` / `placebo_horizons` slots. R's `jsonlite::toJSON` serializes plain `list()` as `[]` (array) but populated named lists as `{}` (object), so consumers iterating `.items()` on the slot saw different shapes across scenarios. Fixed both ends: - R-side: extractors initialize empty slots with `structure(list(), names = character(0))` which jsonlite serializes as `{}` even when empty. Verified across 4 scenarios (20, 21, 22, 23) — all `placebo_predict_het` / `placebo_horizons` slots now serialize as objects regardless of population. - Python-side: added `_as_dict` helper in the parity test module as a defensive backstop coercing any non-dict (None / [] / missing) to {}. Used at the two call sites that read optional placebo slots so consumers can call `.items()` uniformly. The golden JSON regenerated; type-stable across all scenarios (verified via `jq` on each scenario's predict_het type). 314 tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
igerber
added a commit
that referenced
this pull request
May 15, 2026
…tract Two stale in-code documentation surfaces from the latest review: 1. `chaisemartin_dhaultfoeuille.py:980-989` — the `heterogeneity` parameter docstring on `fit()` still said "post-treatment regressions only (no placebo regressions)". Updated to document the post-#422 contract: per-horizon OLS regressions on forward AND backward (placebo) horizons when `placebo=True`; survey_design composes with forward horizons but warns + skips backward horizons until the pre-period cell allocator is derived. 2. `chaisemartin_dhaultfoeuille_results.py:1279-1286` — the heterogeneity summary note didn't mention the survey forward-only fallback. Extended the note to cover the gating semantics so users reading `result.summary()` under `survey_design + heterogeneity` know what they're getting. Both surfaces now match the contract already documented in the API rst, REGISTRY, and CHANGELOG. No behavior change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
TDL77
pushed a commit
to TDL77/diff-diff
that referenced
this pull request
May 18, 2026
A holistic codex review of the merged igerber#412 + cleanup igerber#422 state surfaced three documentation/test gaps that the per-PR cleanup review path could not see (it only scopes to the cleanup diff). All three are at-most P3 in severity but each is real claim-vs-coverage drift. 1. REGISTRY's top-level `Note (Phase 3 by_path ...)` `to_dataframe( level="by_path")` schema list omits the `het_*` columns (`het_beta`, `het_se`, `het_t_stat`, `het_p_value`, `het_conf_int_lower`, `het_conf_int_upper`) that `_to_dataframe` has always emitted since the Phase 5 heterogeneity wave landed. Add them to the schema list so the registry contract matches the implementation. 2. The two new parity tests (`TestDCDHDynRParityHeterogeneity`, `TestDCDHDynRParityByPathHeterogeneity`) assert only `beta` and `se` from the R golden payload, leaving `t_stat`, `p_value`, `conf_int`, and `n_obs` unpinned. A regression in the inference extraction or final-`df_survey` refresh could ship while parity still passes. Pin `t_stat` at `SE_RTOL` (invariant to critical- value distribution since `t = beta / se`) and `n_obs` exactly. 3. While extending the parity assertions, surfaced a real Python-vs-R structural deviation that was undocumented: `_compute_heterogeneity_test` passes `df=None` to `safe_inference`, so Python uses the normal Z critical value (~1.96) for `p_value` and `conf_int`. R `did_multiplegt_dyn(..., predict_het)` uses the t-distribution with df = n - k from the WLS regression. The structural gap produces ~0.1-2% rtol gaps on CIs and p-values that exceed `SE_RTOL` (verified empirically on the parity fixture: CI gap ~0.17% on h=1). Document the deviation in the heterogeneity R-parity Note. Pin only `beta`, `se`, `t_stat`, `n_obs` in the parity tests; `p_value` and `conf_int` parity intentionally skipped. Add a TODO row tracking the optional df-threading work. No estimator behavior, weighting, variance/SE computation, or default-statistical surface changed - documentation accuracy plus expanded regression coverage only. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
TDL77
pushed a commit
to TDL77/diff-diff
that referenced
this pull request
May 18, 2026
Closes TODO igerber#422 + pilot-412 in a single PR (same surface, same R fixture, same parity story). Phase 0 probe verified R behavior: did_multiplegt_dyn(by_path, predict_het, placebo) emits per-path heterogeneity OLS results on backward (placebo) horizons via R's per-by_level dispatcher (DIDmultiplegtDYN:::did_multiplegt_main placebo block at the `effect = matrix(-i, ...)` rbind site). New scenario 22 in benchmarks/R/generate_dcdh_dynr_test_values.R captures this with predict_het=list("het_x", c(-1)) — the c(-1) sentinel triggers "compute heterogeneity for ALL forward (1..effects) AND ALL placebo (1..placebo) positions" per the R source path read at script time. Phase 1A implementation (non-survey): _compute_heterogeneity_test gains a placebo: int = 0 parameter and iterates forward (1..L_max) and backward (-1..-placebo) horizons in a single loop. Explicit `if out_idx < 0: continue` eligibility guard prevents numpy negative-index silent wrap on N_mat[g, out_idx] when F_g - 1 + l_h < 0. _compute_path_heterogeneity_test forwards the param; fit() passes placebo=L_max if self.placebo else 0 to both global and per-path call sites. to_dataframe(level="by_path") placebo rows now read het_* values from path_heterogeneity_effects negative-int keys (mirroring the existing path_placebo_event_study negative-key convention) instead of the pre-PR hardcoded NaN-fill. Survey gate: when survey_design is active AND placebo > 0 + heterogeneity is requested, _compute_heterogeneity_test raises NotImplementedError eagerly with a documented message. The Binder TSL cell-period allocator's REGISTRY justification is tied to post-period attribution; backward-horizon attribution puts ψ_g mass on a pre-period cell, which is a separate library-extension claim that needs its own derivation. Forward-horizon predict_het + survey continues to work unchanged. Pre-period allocator derivation tracked as a new follow-up TODO row. Phase 2 (df threading): _compute_heterogeneity_test now passes df = n_obs - n_params to safe_inference on the non-survey OLS path, matching R did_multiplegt_dyn(predict_het=...)'s t-distribution inference (qt(0.975, df.residual(model)) site). Pre-PR Python used df=None (Z critical), producing 0.1-2% rtol gaps on p_value/conf_int vs R. Existing forward-horizon parity tests now pin t/p/CI at INFERENCE_RTOL=1e-4 (was unpinned). Rank- deficient designs use design.shape[1] as df denominator (pre-drop column count); fully rank-deficient is NaN-short-circuited by the existing guard. Near-rank-deficient edge case tracked as a new Low TODO follow-up. R parity: scenario 22 (multi_path_reversible_predict_het_with_placebo, placebo=2, effects=3, by_path=3) pinned at BETA_RTOL=1e-6/SE_RTOL=1e-5 for beta/se/t_stat/n_obs and INFERENCE_RTOL=1e-4 for p_value/conf_int across 3 paths × (3 forward + 2 placebo) = 15 horizons. Cross-surface tests (TestByPathPredictHetPlacebo): placebo het column population, survey-gate NotImplementedError, forward+survey anti-regression, out_idx<0 eligibility guard, single-path telescope (path_heterogeneity_effects[(only_path,)] == heterogeneity_effects bit- exactly), summary rendering. The two existing local-invariant tests (test_*_inference_matches_safe_inference) refactored to verify SE-derivation wiring (t_stat=beta/se, conf_int symmetric around beta, p_value in [0,1]) without back-deriving n_params. REGISTRY: heterogeneity Z-vs-t deviation note replaced with positive "R parity (post-2026-05-15 df threading)" framing including the rank- deficient caveat. New "Per-path placebo heterogeneity" Note documents the R parity, syntax requirements (c(-1) sentinel), survey gate, and test anchors. CHANGELOG entry under [Unreleased]. llms-full.txt by_path entry extended with placebo het composition + survey-gate mention. API rst extended with the same. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Audit follow-up to PR #412. The restored CI reviewer surfaced findings the degraded reviewer missed across all 5 prior rounds:
P1 (REGISTRY + code comment) - the claim that "R does not ship per-path `predict_het` on placebos either, so parity is preserved by deferral" contradicts what R's `did_multiplegt_dyn(..., by_path, predict_het)` dispatcher does: it forwards `predict_het` into each per-path `did_multiplegt_main` call alongside `placebo`, so R may emit per-path placebo heterogeneity rows we do not yet mirror. Rewrite both surfaces as an explicit Python-side deferral, NOT a verified R-parity. TODO row added to track validating R's actual output and either implementing parity or documenting the deviation explicitly.
P2 (REGISTRY rtol claim) - the per-path heterogeneity R-parity paragraph claimed `rtol ~1e-6 on point estimates AND SE`, but parity tests use `BETA_RTOL=1e-6` and `SE_RTOL=1e-5`. Split the claim and note the WLS-denominator/cohort-recentering numerical drift motivating the looser SE bound.
P2 (replicate-weight df_survey refresh test gap) - the existing test `test_per_path_heterogeneity_replicate_weights_propagates_n_valid` would have passed if the new dedicated refresh loop failed to recompute `t_stat` / `p_value` / `conf_int` at the final `df_survey`. Strengthen to call `safe_inference(beta, se, df=df_survey)` on the first finite entry and assert the stored inference fields match.
P2 (paths_of_interest survey gap) - the documented composability of `paths_of_interest + heterogeneity + survey_design` was not regression-locked (all survey-specific tests used `by_path=k`). Add two new tests: analytical Binder TSL coverage with selector-ordering preservation, and the multiplier-bootstrap gate under `paths_of_interest`.
No estimator behavior, weighting, variance/SE, identification, or default statistical surface changed in source - documentation accuracy plus expanded regression coverage only.
Test plan
🤖 Generated with Claude Code